Skip to content

Conversation

@david-sh-csiro
Copy link
Collaborator

Updated hash key generation and added tests to validate the fix.

…added tests to validate fix and updated the development change notes.
@david-sh-csiro david-sh-csiro linked an issue Dec 12, 2024 that may be closed by this pull request
Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue seems to be due to pydata/xarray#2436

# A float array and an int array mean very different things,
# but could have identical byte patterns.
hash_string(hash, data_array.encoding['dtype'].name)
hash_string(hash, data_array.encoding.get('dtype', data_array.values.dtype).name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dtype encoding value should be present most of the time. It being missing is an unexpected exception to the norm. Without the context of this bug checking data_array.values.dtype looks unnecessary.

Please add brief one or two line description of why we are doing this extra step, and include a link to the xarray bug report tracking this issue: pydata/xarray#2436

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment with issue link and a simple description.

assert result_cache_key_cf == cache_key_hash_cf1d_sha1


def test_cache_key_with_missing_data_array_encoding_type(datasets: pathlib.Path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is too specific. What we actually care about is whether datasets opened via xarray.open_mfdataset() produce a valid geometry hash and we still don't have a test for that. Please update this test to ensure that a dataset opened via xarray.open_mfdataset() produces a valid hash. This means the test will continue to pass even if xarray fix the issue with data_array.encoding being empty, and will fail if some other issue with xarray.open_mfdataset() pops up.

Copy link
Collaborator Author

@david-sh-csiro david-sh-csiro Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added multifile dataset tests. I've included both ugrid and cfgrid specific tests. The cfgrid fixture does lose the encoding as expected, but the test should handle both cases for if and when xarray fixes the issue.

@david-sh-csiro david-sh-csiro changed the title Updated hash key generation to handling missing encoding dtype. Updated hash key generation to handle missing encoding dtype. Dec 12, 2024
Copy link
Contributor

@mx-moth mx-moth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@david-sh-csiro david-sh-csiro merged commit 0579a7c into main Jan 21, 2025
15 checks passed
@david-sh-csiro david-sh-csiro deleted the 166-multifile-datasets-dont-work-with-cache-key-generation branch January 21, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multifile Datasets don't work with cache key generation

3 participants